IBM Support

AFS Disaster Recovery

Product Documentation


Abstract

The goal of this paper is to help prepare against the total loss of
your AFS cell in the event of a disaster. This paper will present what
AFS meta-data needs to be saved (AFS data, volumes, is assumed to be
saved on AFS backup tapes or volume dumps). Three disaster scenarios
will then be presented with steps on how to recover from each.
Finally, a number of disaster recovery issues will be discussed. Tools
to help the system administrator will also be presented.

Content





AFS Disaster Recovery

John Morin
Advanced Member of Technical Staff







Introduction

The potential for total loss of an AFS cell looms over the head of every system administrator. It is important to plan for a disaster in which all local machines, data, and backup tapes are lost. Recovery of data that has been stored offsite is then critical for the survival of your company.

The goal of this paper is to help prepare against the total loss of your AFS cell in the event of a disaster. This paper will present what AFS meta-data needs to be saved (AFS data, volumes, is assumed to be saved on AFS backup tapes or volume dumps). Three disaster scenarios will then be presented with steps on how to recover from each. Finally, a number of disaster recovery issues will be discussed. Tools to help the system administrator will also be presented.





Non-Goals

It is not a goal of this paper to help the administrator determine what data and applications are critical for the success of the company; where or how the information is stored offsite; how often the data is stored offsite; retrieving the data in the event of a disaster; and on which machines the data will be restored to. These are details that must be determined in a general disaster recovery plan.





Software and Hardware requirements


AFS Meta-Data

The following meta-data is needed to begin recovery of data from AFS backup tapes. This information cannot be stored in an AFS volume nor backed up to an AFS tape because it is needed before any data can be restored.

The heart of an AFS cell is the UBIK databases: volume location database (VLDB), authentication database (KADB), protection database (PRDB), and backup database (BUDB). Without them, recovery of a cell would be tedious. Fortunately, most of the AFS meta-data is platform independent. This means the information can be restored to any machine type or OS regardless of what machine it was saved from.

  • The AFS binaries as distributed by Transarc: One for each type of machine you have within your restore cell.
  • A copy of the UBIK databases: all /usr/afs/db/*.DB0 files. These files are platform independent. When saving the UBIK databases, be sure that a UBIK server processes are not running on the machine you are saving from.
  • A copy of the AFS cell information directory: The entire /usr/afs/etc/ directory. These files are platform independent.
  • A copy of the client CellServDB file: /usr/vice/etc/CellServDB. This file is platform independent.
  • A copy of /etc/passwd and associated files (ie: /etc/shadow). This file is machine dependent.
  • Other AFS data (machine/platform dependent)
  • /usr/afs/backup/tapeconfig (for reference of tape settings);
    /usr/afs/backup/CFG_*
    (for reference of tape configuration); and
    any backup (un)mount scripts for automated backup and recovery.
  • /etc/rc/afs or equivalent AFS startup script.
  • /usr/afs/local/BosConfig (for reference of server options only).
  • /etc/pam.conf: hp UX 11.0 and up, and Solaris 2.6 and up.
  • Server preferences in clients.
  • For volumes that are not restored but re-created, their name, initial quota, and ACL.
  • Other non-AFS data.

Hardware

Whether you are restoring a critical subset of the data or the entire cell, be sure you have the necessary hardware requirements in diskspace for database servers, file servers, and clients.

Before recovery can proceed from backup tapes, the same type of tape drive used to save the data is needed for restores.

Incorrect configuration of the tape drive can also prevent reading AFS tapes. The tape compression, density, and block size need to be configured correctly.

Saving AFS Meta-Data

When to create the backups of the volumes and UBIK meta-data is important in the ease of recovering the data. The volume dumps should be saved first and then the AFS meta-data; and they should be saved as close in time to each other as possible.





Recovering Fileservers

In this scenario, one or more file servers were lost. The cell, though, is not down. The database servers are still up and the remaining file servers are up. The goal is to recover the lost file servers as quickly as possible. The newly recovered file servers may have different IP addresses.

You may either replace the existing file servers or restore the data from the lost file servers to the remaining file servers within the cell. In both cases, the commands to restore the data will remain the same.

Recovering File Server Machines For each file server machine:


1. Install AFS as a file server machine.
2. Install a new /usr/afs/etc/ directory. You can copy or ftp the directory from another server machine within the cell. Do not copy any other /usr/afs/ directories.
3. Start the bosserver and create the fs instance.

Recovering Volume Data

Prior to restoring from AFS backup you may need to configure the backup subsystem to use the new machines and tapedrives (if the backup machine were lost).


1. Create/install /usr/afs/backup/tapeconfig on the backup tape machine as well as any other configuration files CFG_* and (un)mount scripts for automated backup and recovery.
2. Update the host list within the backup database; backup listhost, delhost, and addhost; so the desired port number points to the correct backup tape machine.
3. Do not update volumeset entries yet. They are used to expand voumesets ( backup volsetrestore).

Two AFS backup commands are useful in recovering data that was on a file server. They are backup diskrestore and backup volsetrestore.

  • backup diskrestore
  • The backup diskrestore command will restore a single AFS vice partition. Diskrestore has the following advantages.

    • You can easily specify an alternate server and partition to restore the information to. Useful for restoring a partition to a spare vice partition on another server.
    • You can start more than one diskrestore operation thus making use of multiple tape drives restoring data. Depending on how volumesets are dumped, the tapes may need to be shared across the different restore operations.
  • backup volsetrestore
  • The backup volsetrestore will take a volumeset as an argument and expand it out into a list of volumes to restore. For instance, to restore a single partition (like diskrestore), a volumeset entry would be created with the following entry:

           server             partition  volume
          afs3.transarc.com      b        .*
    To restore all volumes from two servers, a volumeset would contain the following two entries:
            server            partition  volume
           afs3.transarc.com     .*        .*
           afs4.transarc.com     .*        .*
    Volsetrestore has the following advantages and disadvantages:
    • In determining a volumeset to restore, you can group together a large number of volumes to restore with a single command. The restore of the volumes minimizes tape changes that multiple restores may require.
    • You can eliminate sharing tapes across different restores by restoring the volumesets used to perform dumps.
    • Volsetrestore has no -newserver or -newpartition options (diskrestore has these options). Though, volsetrestore has the ability to read a file of entries (each entry containing a volume-name, server-name to restore to, and partition-name to restore to) and restore them accordingly. The -n flag to volsetrestore will generate a list to restore which you can then modify and pass back to volsetrestore with the -file option.
    • All volumes to restore are restored from a single tape drive. To make use of different tape drives, you must start more than one restore.

Notes

Both backup diskrestore and backup volsetrestore rely on the VLDB to determine what volumes need to be restored (except for the -file option to volsetrestore). Do not run vos syncvldb or vos syncserv when recovery is in progress.

Both commands also rely on the BUDB to determine which backup tapes are needed.

Both commands restore readwrite volumes only. Backup and readonly volumes are not recreated. Backup volumes are recreated on the next vos backup or vos backupsys. Readonly volumes on the lost servers need to be removed, vos remsite, and optionally recreated on a new server (see section Restoring Volumes and Backup Data).





Recovering Database Servers

In this scenario one or more (but not all) database servers were lost. The remaining database servers are up but may not be in quorum. The cell remains up and data is accessible. The goal is to get the down database servers up and in quorum as quickly as possible.

One or both of the following approaches can be used to get the database servers back up:


1. Remove the lost database server machines from the CellServDB file. You lose the replicated database server site, but this is useful for quickly getting the remaining database servers back in quorum.
2. Restore the lost database servers. The restored database server machines may or may not have the same IP addresses.

Removing Database Servers


1. Remove the lost database server machines from the /usr/afs/etc/CellServDB and
/usr/vice/etc/CellServDB
files.
2. Distribute the changes to the other database servers, file servers, and clients.
3. Restart the database servers so they will recognize the new list of database servers.
4. Reboot the client machines or use fs newcell.
5. Restart the file servers to recognize the new list of database servers. File server restarts can be deferred untill an appropriate time.

Recovering Database Servers


1. Make a copy of the UBIK databases: all /usr/afs/db/*.DB0 files. When saving the UBIK databases, be sure that the UBIK server processes are not running on the machine you are saving from.
2. Install AFS on each new database server machine.
3. For each new database server machine, install a new /usr/afs/etc/ directory. You can copy or ftp the directory from another server machine within the cell. Do not copy any other /usr/afs/ directories.
4. Update the list of database server machines within the /usr/afs/etc/CellServDB and /usr/vice/etc/CellServDB files.
5. Distribute the changes to all other database servers, file servers, and clients.
6. Restart the remaining database servers so they will recognize the new list of database servers.
7. Start the bosserver and create the database server instances. Once the database server processes are started, they will enter the UBIK quorum and copy the database over to themselves.
8. Reboot the client machines or use fs newcell.
9. Restart the file servers to recognize the new list of database servers. File server restarts can be deferred untill an appropriate time.



Recovering an entire AFS Cell

In this scenario, a true disaster has occurred: All database servers and all or most file servers are lost. The cell is down and your business is at a standstill until data is restored from tape. The goal is to get the cell up and running as quickly as possible. Some assumptions made while recovering the cell are as follows:
1. The cell name will remain the same.
2. All machines in the new cell may have new IP addresses.
3. Data is saved using AFS backup or volume dumps.

Recovering Database Servers


1. Install AFS for each database server machine.
2. Restore the /usr/afs/db/ directory to each database server machine. Specifically the UBIK database files: bdb.DB0, kaserver.DB0, prdb.DB0, and vldb.DB0.
3. Restore the /usr/afs/etc/ directory for each database server machine and update the /usr/afs/etc/CellServDB file to reflect the new database servers within the cell.
4. Start the bosserver and create the database server instances on each database server machine. Once the database server processes are started, they will establish UBIK quorum.
5. Change the keys for the cell using bos addkey and kas setpasswd afs. This is important if the original cell is still up. Otherwise, both cells will have the same key and volume restores will remove volumes from the original cell.

Recovering File Servers

For each file server machine:


1. Install AFS as a file server machine.
2. Restore the /usr/afs/etc/ directory and update the /usr/afs/etc/CellServDB file to reflect the new database servers within the cell. Copy the new /usr/afs/etc/KeyFile from a database server machine.
3. Start the bosserver and create the fs instance.

Recovering Client Machines

For each client machine:


1. Install AFS for a client machine.
2. Restore the /usr/vice/etc/CellServDB file and update the database server machines for the cell.
3. Restore the /etc/passwd file and any other AFS data (see section AFS Meta-Data).
4. Start the afsd process along with any options needed for this cell.

Restoring Volumes and Backup Data

Once the UBIK servers, file servers, and butc machines are up, the AFS data can now be restored from backup tapes. Restores from tape can proceed either from using the backup diskrestore or backup volsetrestore commands (described in Recovering Volume Data.

    • Core Volumes
    • Two key volumes to restore are ``root.afs'' and ``root.cell''. They do not need to be the first volumes to be restored, but are necessary for access to the AFS cell.

      Some administrators may define a subset of volumes that make up the ``core'' of the business: the core information necessary for the business to survive. Knowing which core volumes to restore beforehand as well as planning for where the core volumes will be mounted in the restored tree will save time. Creating the mountpoints beforehand will save creating them after the restore.

    • Readonly Volumes
    • The readonly entries, which reference a server that is no longer in the cell, continue to exist within the VLDB. Accessing the readonly will fail and even discard your token if the file server from the original cell is still up. Old ``root.afs.readonly'' and ``root.cell.readonly'' volumes will render the cell inaccessible.

      The readonly volumes must be removed using vos remsite. The following is an example of how to generate a list of ``vos remsite'' commands for all RO volumes within the cell:


      vos listvldb | awk '(NF == 1) {v = $1} /RO Site/ {print ``vos remsite'',$2,$4,v}'

      Once removed, new readonly volumes can be created and released.

    • Volumes not restored
    • Volumes not restored also continue to exist within the VLDB. Accessing the volumes will fail and even discard your token if the file server from the original cell is still up. These volumes can either be removed, or removed and recreated, or left alone (accesses will fail).

      Note that if the volume needs to be recreated, appropriate quota and ACLs also need to be set.

    • Backup Volumes
    • Restores do not recreate backup volumes, although the VLDB will continue to say they exist. The next ``vos backup'' or ``vos backupsys'' should recreate these new backup volumes. The following is an example of how to generate a list of ``vos backup'' commands for all backup volumes listed within the VLDB:

       vos listvldb | awk '(NF == 1) {v = $1} /Backup:/ {print ``vos backup'',v}'

Notes

Things to be aware of:

  • Do not run vos syncvldb or vos syncserv during recovery. The commands will remove volumes from the VLDB. The list of volumes in the VLDB are needed for backup diskrestore and backup volsetrestore.
  • Note that the /usr/afs/local/ directory was neither saved nor restored. This information is not needed. The saved BosConfig file is for reference only.
  • Volumes created after the last save of the VLDB are not visible to the volsetrestore command. They will need to be restored individually.



WHAT IFs

What if I want a new cell name?

The KADB contains cell specific information and must change. The KADB can be either rebuilt using the kadb_check tool (see section on reading UBIK databases); or changed during installation of the database servers. The following steps describe how to change the KADB during installation of the database servers.


1. Install AFS on each new database server machine.
2. Restore the /usr/afs/db/ directory to each database server machine. Specifically the UBIK database files: bdb.DB0, kaserver.DB0, prdb.DB0, and vldb.DB0.
3. Restore the /usr/afs/etc/ directory for each database server machine and update
/usr/afs/etc/ThisCell
and /usr/afs/etc/CellServDB file to reflect the new cell and its database servers.
4. Start the bosserver in noauth mode, and create the database server instances on each database server machine. Once the database server processes are started, they will establish UBIK quorum.
5. Delete the ``krbtgt.<CELLNAME>'' KADB entry where <CELLNAME> is the capitalized name of the original cell. Example:
 kas delete krbtgt.AFSDEV.TRANSARC.COM -noauth
6. Create the correct ``krbtgt.<CELLNAME>'' entry and update its fields. Example:
 kas create krbtgt.DR.TRANSARC.COM -noauth
 kas setfields krbtgt.DR.TRANSARC.COM -flags NOTGS+NOSEAL -noauth
7. Change the password for the ``admin'' account.
8. Change the key for the cell using bos addkey and kas setpasswd afs. This is important if the original cell is still up. Otherwise, both cells will have the same key and volume restores will remove volumes from the original cell.
9. Restart the database servers in authentication mode.
10. Because all user keys stored in the KADB are based on the user password and the cell name, each user password will need to be changed. The admin account is the only account whose password has already been changed and which you can authenticate as.
11. As installation continues for fileservers and clients, the ThisCell and CellServDB files within /usr/afs/etc/ and /usr/vice/etc/ must also be updated.
12. Once the ``root.afs'' volume is restored, create a new mountpoint within it to your new cell:
 fs mkmount /afs/<cellname> root.cell -cell <cellname>

The old cell's mountpoint needs to be removed or made into a cross cell mountpoint.

Mountpoints to volumes within the cell will work unless the mountpoint was created with the ``-cell'' option. These mountpoints need to be removed and recreated. Mountpoints to volumes in other cells will continue to work.

What if some volumes are not listed in the VLDB?

The VLDB is the primary source for determining which volumes are in the cell. Once a volume is removed from the VLDB, then there is no record that the volume exists. Running vos syncvldb or vos syncserv are the most common way many entries may be removed from the VLDB. Avoid these commands in recovery situations.

Once this happens, then the list of volumes to restore need to be generated using one of the three methods. Then use the backup volsetrestore -file < filename> command to restore these volumes.

Generate a list of volumes to restore by searching the backup database. The backup dumpinfo -id < dumpid> command will list all the volumes included in a dump.


Generate a list of volumes to restore from an older copy of the volume location database, vldb.DB0 (see section ).

What if a tape I want to restore from is not in the BUDB?

Each of the backup restore commands rely on the volume being listed in the BUDB before it can be restored. It needs the BUDB to determine on what tape a volume is on and where on the tape the volume is.

Use backup scantape -dbadd to scan the dumpset and add each of the volume entries to the backup database, or

Extract individual volumes from the tape (see section Restoring AFS Dumps to Non-AFS Space) and restore them.


What if butc prompts for a tape that is damaged?

If a number of backup tapes were damaged or lost in a disaster, they may not be available for restore, but may be listed in the backup database as available for restore.

In AFS 3.5, butc will allow you to skip a tape by typing 's' at the tape prompt. All remaining dumps that were to be restored from the tape will then be skipped. The butc process will continue to prompt for any remaining tapes.

Prior AFS 3.5, abort the restore, remove the dump from the backup database, backup deletedump, and then perform the restore over again.

Abort the restore and then start a date-specific restore prior the lost tape.

Can I read the UBIK databases directly?

AFS 3.5 will allow you to restore volume dumps to non-AFS space building a directory tree and populating it with files from the volume dump. This can be useful if you do not have AFS installed. The tool can be run on a machine that does not have AFS installed.

ACL information is lost and volume mountpoints will be converted to symbolic links.


 restorevol [-file <dump file>] [-dir <restore dir>]
[-extension <name extension>] [-mountpoint <mount point root>]
[-umask <mode mask>]

The volume dump is input from stdin or from a file using the -file option. The volume will be recreated within the current directory or -dir. The volume's root directory will be its RW volume name appended with an optional extension, -extension. Volume mountpoints are converted to symbolic links referencing its volume within the same directory this volume was restore to, -dir, or that specified in -mountpoint). A umask for all file's and directory's mode bits can also be specified with umask.

Can I restore AFS volume dumps to non-AFS space?

AFS 3.5 will allow you to extract volume dumps directly from a backup tape. With the volume, you can either restore it to a fileserver with vos restore; or restore it to non-AFS space with restorevol (see section Restoring AFS Dumps to Non-AFS Space). This tool can be run on a machine that does not have AFS installed.


 read_tape -tape <tape device> [-restore <# volumes to restore>]
[-skip <# volumes to skip>] [-file <filename>] [-scan ] [-noask ]
[-label ] [-vheaders ] [-verbose ]

Read an AFS Backup tape prompting if you want to extract each volume found. The -noask flag will not prompt. The volume dump is written to a file whose name is the RW name of the volume dumped. The -file option allows you to specify an alternate file name. If you know where on the tape the volume is at, you can skip a number of volumes to that spot using the -skip option and restore a specific number of volumes using the -restore option. Read_tape does not re-wind the tape allowing you to rerun the tool while the drive is positioned between volume dumps.

The -label and -vheaders flags will display the full dump labels and volume header labels.

Can I extract volumes off of an AFS backup tape?

AFS 3.5 will allow you to extract volume dumps directly from a backup tape. With the volume, you can either restore it to a fileserver with vos restore; or restore it to non-AFS space with restorevol (see section on restoring to non-AFS space). This tool can be run on a machine that does not have AFS installed.


 read_tape -tape <tape device> [-restore <# volumes to restore>]
[-skip <# volumes to skip>] [-file <filename>] [-scan ] [-noask ]
[-label ] [-vheaders ] [-verbose ]

Read an AFS Backup tape prompting if you want to extract each volume found. The -noask flag will not prompt. The volume dump is written to a file whose name is the RW name of the volume dumped. The -file option allows you to specify an alternate file name. If you know where on the tape the volume is at, you can skip a number of volumes to that spot using the -skip option and restore a specific number of volumes using the -restore option. Read_tape does not re-wind the tape allowing you to rerun the tool while the drive is positioned between volume dumps.

The -label and -vheaders flags will display the full dump labels and volume header labels.




About this document...


AFS Disaster Recovery

This document was presented by John Morin, Advanced Member of Technical Staff, at the Decorum '98 conference in San Antonio, Texas, in March of 1998.

Please contact AFS Product Support with any questions or comments relating to this document.



[{"Product":{"code":"SSXMUG","label":"AFS"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"backup","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.6","Edition":"Standard","Line of Business":{"code":"LOB35","label":"Mainframe SW"}}]

Document Information

Modified date:
17 June 2018

UID

swg27004312